Solution: Use the Right Tool for the Job
Let’s explore the alternative technologies offered in the different database brands for search.
It’s best to use specialized search engine technology instead of SQL. Another way to optimize is to reduce the recurring cost of search by saving the results. The following sections describe some of the technologies offered as built-in extensions by different database brands, as well as technologies offered by independent projects. We’ll also develop a solution that uses standard SQL but is more efficient on average than substring matching.
Vendor extensions#
Every major brand of a database has invented its own answer to the common requirement of full-text search, but these features are not standard or compatible between database brands. If we use a single brand (or are willing to use vendor-dependent features), these features are the best way to get high-performance text search, with the greatest integration with SQL queries.
The following are brief descriptions of full-text search features in several brands of SQL databases. The details are subject to change, so we must always read the current documentation for our brand.
Full-text index in MySQL#
MySQL provides a simple full-text index type for the MyISAM storage engine. We can define a full-text index over columns of type CHAR
, VARCHAR
, or TEXT
. Here’s an example that defines a full-text index that includes content from the bug summary
and description
columns:
Let’s use the MATCH()
function to search for a keyword among the indexed text. We must name the columns in the full-text index (so we can match using another index that covers different columns in the same table).
Since MySQL 4.1, we can also use a simple Boolean expression notation in the pattern to filter results more carefully.
Text indexing in Oracle#
Oracle had supported text-indexing features since Oracle 8 in 1997 when it was part of a data cartridge called ConText. The technology has been updated several times, and the feature is now integrated into the database software. The text indexing in Oracle is complex and rich, so here is a greatly simplified summary:
CONTEXT
#
We can create an index of type CONTEXT
for a single text column. We can use the CONTAINS()
operator to search using this index. The index doesn’t stay consistent with changes to data unless we define the index with PARAMETERS ('SYNC (ON COMMIT)')
.
CTXCAT
#
The CTXCAT
index type is specialized for short text samples such as those used in online catalogs, along with other structured columns from the same table. The index stays consistent as transactions update the indexed data.
The CATSEARCH()
operator takes two arguments for searching the text column and the structured column set, respectively.
CTXXPATH
#
The CTXXPATH
index type is specialized for searching an XML document with the existsNode()
operator.
CTXRULE
#
Let’s suppose we have a large collection of documents in our database, and we need to classify them based on their content.
With the CTXRULE
index, we can design rules to analyze documents and report their classification. Alternatively, we can also provide a sample set of documents with classifications and have Oracle design the rules to apply to the rest of the document collection. We can even fully automate the process, letting Oracle analyze our collection of documents and come up with a set of rules to classify them.
Full-text search in Microsoft SQL Server#
SQL Server 2000 and later support full-text searching, with complex configuration options for languages, a thesaurus, and automatic synchronization with data changes. SQL Server provides a series of stored procedures for creating full-text indexes, and we can use the CONTAINS()
operator in queries to employ them.
To perform the familiar task of searching for bugs that include the word “crash”, we must first enable the full-text feature and define a catalog in our database:
Next, we need to define a full-text index on the Bugs
table, add columns to the index, and activate the index:
The next step is to enable automatic synchronization for the full-text index so that changes to the indexed column are propagated to the index. Then, we can begin the process of populating the index. This will run in the background, so it may take some time to complete before queries benefit from the index.
Finally, we must run a query using the CONTAINS()
operator:
Text search in PostgreSQL#
PostgreSQL 8.3 provides a sophisticated and highly configurable way of converting text into a searchable collection of lexical elements and of matching these documents against patterns.
To achieve the best performance, we need to store the content in its original text form as well as in a searchable form using the special data type TSVECTOR
.
We need to make sure that the TSVECTOR
column is kept in sync with the content in the text column(s) we want to make searchable. PostgreSQL provides a built-in trigger procedure to make this easier:
We should also create a generalized inverted index (GIN) on the TSVECTOR
column:
After this, we can use the PostgreSQL text search operator @@
to search efficiently, aided by the full-text index:
/
- main.sql
There are many other options for customizing searchable content, search queries, and search results.
Full-text search (FTS) in SQLite#
Standard tables in SQLite don’t support efficient full-text searches, but we can use an optional extension for SQLite to store searchable text in a virtual table specialized for searching text. Three versions of the searchable text extension exist, known as FTS1, FTS2, and FTS3.
FTS extensions are not typically enabled in a default build of SQLite, so we need to build it from source with one of the FTS extensions enabled. For example, we can add the following options to Makefile.in
, and then build SQLite.
Once we have a version of SQLite with FTS enabled, we can create a virtual table for the searchable text. Any data type, constraints, or other column options are ignored.
If we are indexing text from another table (as in this example using the Bugs
table), we must copy the data into the virtual table. The FTS virtual table always contains a primary key column called docid
, so we can correlate rows to those in a source table.
Now we can query the FTS virtual table BugsText
using the efficient full-text search predicate MATCH
, and we can join matching rows to the source table Bugs
. Using the name of the FTS table as a pseudo-column matches the pattern against any column.
The matching pattern also supports limited boolean expressions.